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ABSTRACT 

Peptide recognition domains and transcription 
factors play crucial roles in cellular signaling. They 
bind linear stretches of amino acids or nucleotides, 
respectively, with high specificity. Experimental 
techniques that assess the binding specificity of 
these domains, such as microarrays or phage 
display, can retrieve thousands of distinct ligands, 
providing detailed insight into binding specificity. 
In particular, the advent of next-generation 
sequencing has recently increased the throughput 
of such methods by several orders of magnitude. 
These advances have helped reveal the presence 
of distinct binding specificity classes that co-exist 
within a set of ligands interacting with the same 
target. Here, we introduce a software system 
called MUSI that can rapidly analyze very large 
data sets of binding sequences to determine the 
relevant binding specificity patterns. Our pipeline 
provides two major advances. First, it can detect 
previously unrecognized multiple specificity 
patterns in any data set. Second, it offers 
integrated processing of very large data sets from 
next-generation sequencing machines. The results 
are visualized as multiple sequence logos 
describing the different binding preferences of the 
protein under investigation. We demonstrate the 
performance of MUSI by analyzing recent phage 
display data for human SH3 domains as well as 
microarray data for mouse transcription factors. 



INTRODUCTION 

The wiring diagram of cellular signaling pathways is 
formed by specific molecular interactions involving 
proteins, DNA and other molecules (1,2). Among these, 
signaling protein-protein interactions typically consist of 
protein domains [such as kinases (3-5), SH3 (6) or PDZ 
(7,8)] binding short unstructured regions on their target 
proteins. These regions are characterized by very specific 
linear sequence motifs that are recognized by the domain 
they bind to. For instance, SH3 domains are known to 
target PxxP motifs with a positively charged residue 
either on the left (Class I, [R/K]xxPxxP), or on the right 
(Class II, PxxP[R/K]) of the proline-rich region (6). 
Similarly, DNA binding domains of transcription factors 
(TF) make direct contact with short stretches of nucleo- 
tides that display high sequence specificity (9). This speci- 
ficity is crucial for enabling proteins to interact selectively 
with their cognate partners within the crowded intra- 
cellular environment. Detailed understanding of binding 
specificity encoded in these motifs is very powerful to 
accurately predict novel interactions (4,10-13) and for 
the design of new inhibitor compounds (14). 

Various technologies, such as microarrays (12,15,16), 
SPOT arrays (17), phospho-proteomics arrays (18), or 
phage display (19), have been designed to characterize 
the binding specificity of protein domains and transcrip- 
tion factors. Data from these experiments enable compu- 
tational models to describe binding specificity. One 
well-known such model is the Position Weight Matrix 
(PWM, also known as Position-Specific Scoring Matrix). 
This model has been widely applied to characterize the 
binding specificity of both peptide recognition domains 
and transcription factors (20-23). However, several 
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recent studies suggest that the use of single PWMs leads to 
a reductive view of binding specificity, since a PWM does 
not consider correlations between different ligand 
positions (5,16,24,25). To overcome this limitation, differ- 
ent strategies have been developed based on neural 
networks (5), hidden Markov models (25) or clustering 
(24,26). The latter describes binding specificity with 
multiple PWMs corresponding to clusters of ligands that 
follow the same specificity. The results of such analysis can 
be readily visualized as multiple sequence logos. Clear 
examples of multiple specificity were encountered in 
several peptide recognition domain families (24), as well 
as in transcription factors (16). 

Most of these computational tools work efficiently with 
up to a few hundred ligands. However, recent technologic- 
al advances have increased the throughput of the afore- 
mentioned experimental methods by several orders of 
magnitude. In particular, combining the power of phage 
display with next-generation sequencing currently enables 
the retrieval of thousands of different ligands binding to 
the same domain (27,28). This deluge of data represents 
both a challenge and an opportunity. On the one hand, it 
requires more efficient and faster processing systems. On 
the other hand, it enables analysis at greater resolution, 
such as distinguishing between different multiple binding 
specificities. Here, we introduce the integrated system 
Multiple Specificity Identifier (MUSI) that addresses 
both these issues, enabling high-throughput analysis of 
large data sets and detecting novel multiple specificity. 
MUSI provides a simple interface for processing short 
peptide or nucleic acid sequence data. Starting from a 
set of sequences observed to bind to a given target, it 
automatically generates an optimal number of PWMs 
based on the different specificity patterns present in 
the data. The results are graphically displayed in a table 
of sequence logos (Figure 1). These are useful for 
visualizing the different binding specificities. The numer- 
ical values of the different PWMs are also provided so that 
the user can quantitatively compare them, or use them to 
predict protein-protein or protein-DNA interactions. We 
expect MUSI to be particularly relevant to researchers 
working with phage-display, peptide arrays, protein 
binding microarrays or similar high-throughput 
technologies to map binding specificity of protein inter- 
action domains, RNA-binding proteins or transcription 
factors. 



MATERIALS AND METHODS 

MUSI aims to provide a robust and user-friendly interface 
to identify multiple specificity within a set of sequences 
(either peptides or nucleic acids). Usually, these sequences 
would share some common property, such as binding to 
the same target. The algorithm behind MUSI is based on 
the idea of fitting several linear probabilistic models (in 
our case, PWMs) to a set of sequences to optimally 
describe the different specificity classes. It relies on the 
mathematical tools of mixture models and uses a 
Maximum Likelihood approach with Dirichlet priors for 
fitting. It accepts a variety of input formats from 



raw Illumina sequencing reads (FASTQ) to peptide or 
nucleotide sequences in standard FASTA files. The 
program can be executed from the command line or 
with a Graphical User Interface (GUI). 

Preprocessing of the data 

First, a set of N unique sequences is generated 
by removing duplicates, since multiple occurrences of the 
same sequence are often not very informative in terms of 
specificity and could originate from experimental biases 
(e.g. artificial amplification of the same ligand along 
experimental procedure). These sequences are then 
aligned using MAFFT, a sequence alignment tool (29). 
In the case of short ligands binding to a linear epitope, 
internal gaps are not very structurally meaningful and 
may even prevent relavant specificities from being distin- 
guished. Hence, we use a large gap opening penalty that is 
iteratively increased until internal gaps are eliminated. 
Both the redundancy removal and alignment steps can 
be optionally skipped in MUSI. Special options are avail- 
able for FASTQ data pre-processing (described later). 

Algorithm 

Starting from a set of TV aligned sequences, we use a 
mixture model to identify multiple PWMs (24). Aligned 
sequences are modeled as strings of M letters, taken from 
an alphabet of size S (S = 20 for proteins and S = 4 for 
DNA or RNA). In this model, the specificity is described 
with K different PWMs where d\ corresponds to the prob- 
ability of residue or nucleotide i at position / according to 
the kih PWM. A weight n k is also associated with each 
PWM. The goal of the algorithm is therefore to identify 
the optimal parameters 0\ and tj^. For a given K, this 
optimization is carried out using standard Maximum 
Likelihood with the Expectation-Maximization (EM) 
algorithm (24,26). For each optimization, we generate 
10 random initial configurations (i.e. 10 random assign- 
ments of the sequences to K groups) and keep the one that 
gives the highest log-likelihood value. 

The problem of finding an optimal value for K is more 
challenging and several different methods have been 
designed, such as the Bayesian Inference Criterion or 
Kolmogorov-Smirnov tests (30). Here, the number of 
PWMs is automatically determined by the algorithm in 
the following way. Starting from K = 1 PWM, we itera- 
tively increment K by one and run the mixture model as 
described previously. The new configuration with K+ 1 
PWMs is accepted if: 

(i) Each PWM has a weight if larger than P = 0.01 
and larger than 5/N (if N> 5). 

(ii) For any pair of PWMs the Euclidean distance be- 
tween the probabilities of at least two positions 
is larger than a cut-off distance D. In 
other words, for all (k, k'), k^k ', there exist at 
least two distinct positions l x and l 2 such that 
(Eli (0\-e^ 2 f 5 >D and (Ef =1 (0% -Ofj 2 ) 0 ' 5 >D. 

(m) K<K max 

Condition (i) ensures that each specificity is represented 
by the minimal number of sequences in the input data. 
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Figure 1. General diagram of MUSI describing the steps from sequence files to multiple sequence logos. With typical FASTA input, MUSI optionally 
removes redundant sequences, then runs multiple sequence alignment. It then predicts multiple binding specificity. Finally, it generates logos (Examples: 
Src SH3 domain, and smad3 TF). When handling raw and barcoded sequencing reads (FASTQ) (for multiplexed applications), it begins by filtering 
and sorting by barcodes. The red arrows indicate different starting points that can be selected by the user depending on the kind of input data. 



This is important to prevent spurious PWMs supported by 
only one or two sequences that most likely result from 
sub-optimal maxima in the EM optimization or false- 
positives in the experimental data. Condition (ii) 
prevents redundancy among the different specificities 
identified by the algorithm by ensuring that any pair of 
PWMs has at least two positions with significantly differ- 
ent specificities. This typically results from positional 
correlations between these positions, which are known 
to underlie multiple specificity (24). The maximal 
number of PWMs (K mSLX ) in condition (iii) is set to 10 by 
default, but can be modified by the user. To prevent the 
algorithm from terminating too early because of criterion 
(ii), we chose a relatively small value of D = 0.5. As a last 
post-processing step, we merge pairs of PWMs with less 
than two positions with a Euclidean distance larger than 
D' = 0.63. These cut-off values were manually chosen 
based on our experience with the method in order to 
ensure that multiple PWMs are not redundant. Different 
choices for P, D and D' do not significantly affect the per- 
formance of the algorithm, as shown in Supplementary 
Table SI. Alternatively, the number of PWMs can also 
be fixed by the user. 

The time complexity of this algorithm scales as 
0(NMK) if the number of PWMs is fixed and 0(NMK 2 ) 



if it is decided by the algorithm. In particular, it runs much 
faster than a previous method (24) that relied on a clus- 
tering of the ligands scaling typically as 0(N 2 ). As such it 
can be applied to very large data sets consisting of thou- 
sands of sequences in a few seconds (see 'Results' section). 

Output 

The results of the mixture model are used to draw 
sequence logos corresponding to each specificity and 
these can be displayed in a table such as the one shown 
in Figure 1 . Additionally, the numerical values of both the 
single and multiple PWMs, together with their weights, 
can be exported for subsequent analysis, such as genome 
or proteome scanning to predict protein-DNA or protein- 
protein interactions. For each sequence in the input set, we 
also provide its probability with respect to each of the 
PWMs (so-called responsibilities). These are useful to 
estimate which of the different specificity classes each 
ligand belongs to. 

Pipeline requirements 

MUSI is written in Perl and C++. It uses the Biologically 
Relevant Analysis of Interaction Networks (BRAIN) 
library, which is built on top of Biojava (31), to generate 
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the sequence logos (PDF and PNG files) (32). It uses 
MAFFT (29) to align the sequences. Both are integrated 
in the system. The Graphical User Interface (GUI) for 
Mac OS X is a Cocoa wrapper for perl scripts, written 
in Objective C. For PDF table generation, pdfLaTeX is 
required. 

The pipeline is available for download (http://www 
.kimlab.org/software/musi) and runs locally on most 
platforms (Mac OS X, Linux and UNIX, the GUI 
version is only available on Mac OS X). Detailed instal- 
lation instructions are provided on the website. 

Processing raw reads from an Illumina sequencer 

A set of special input options is included in MUSI to 
process FASTQ sequence data labeled by barcodes. 
These features enable straightforward processing of data 
obtained from Illumina sequencing. Supplementary 
Figure SI summarizes the different steps described here. 

Filtering reads using the Phred quality score. To obtain 
high-quality results with minimal noise from sequencing, 
the sequences are filtered based on quality scores. To 
measure sequencing quality, our system averages the 
Phred scores within each sequence (33). Users can select 
separate minimum average Phred scores for the sequence, 
forward barcode and reverse barcode. This is because the 
quality of reads tends to decrease toward the end of each 
sequence. The default threshold for all three sections is 25 
(99.7% sequencing accuracy) 

Sort by barcodes. Multiplexing using barcodes is common- 
ly used to make optimal use of sequencing capacity. 
Hence, MUSI supports multiplexing, by sorting processed 
reads according to a user-supplied list of barcodes. In this 
step, the system sorts reads from different barcodes into 
separate files to process many domains at once. 

After sorting and filtering, MUSI continues processing 
as described earlier for FASTA files. 

Experimental data sets 

All sets of nucleotides binding to 305 different transcrip- 
tion factors and homeodomains come from three studies 
based on protein binding microarrays (16,34,35) and were 
downloaded from the UniPROBE website (36). By testing 
all possible 8-mers, this technique generates both positive 
and negative data. We used a cut-off values of 0.35 on the 
enrichment score to define positives and of —0.3 to define 
negative data, To map the specificity of the human Src 
SH3 domain, we used phage-display technology (19), 
where very large libraries of random peptides can be ex- 
pressed on the surface of phage particles. The phage 
colonies were then sequenced with Illumina deep 
sequencing. In this way, a total of 2457 unique peptides 
were identified to bind to Src SH3 domain (raw data for 
this domain can be downloaded from http://www.kimlab 
. org/software/musi) . 

Comparison with BEEML 

To compare MUSI with BEEML (37), we have down- 
loaded intensity information (normalized intensities and 



60-mer probe sequences) and seed PWMs from 
UniPROBE (36) for each protein from Berger et al. (35). 
After running BEEML on each pair of inputs, we have 
used the PWMs from BEEML to perform cross-validation 
in the same way as cross-validation of PWMs from 
UniPROBE. 



RESULTS 

Cross-validation 

To probe the accuracy of the different PWMs generated 
by MUSI, we performed standard 10-fold cross- 
validation. We used the data from UniPROBE (36) con- 
sisting of in vitro DNA sequences binding to different 
transcription factors generated with protein binding 
microarrays (16,34,35). We split the positives and nega- 
tives into 10 groups for cross-validation. The multiple 
PWMs were generated with MUSI based on 9 of the 10 
groups (training set), and used to compute the score of the 
sequences in the last group (testing set). As MUSI does 
not incorporate information about negative data sets, only 
positive examples were used to build the multiple PWMs. 
The procedure was repeated 10 times for each domain, 
each time using a different group of positives and nega- 
tives for the testing set. Figure 2 shows the average of 
Receiver Operating Curve (ROC) over all transcription 
factors and all 10 cross-validation runs. We compared 
MUSI results with the PWMs generated by the BEEML 
method (37), the MEME software (with the maximum 
number of motifs set to five) (38), as well as the ones 
provided on the UniPROBE website itself. Both MEME 
and UniPROBE methods are able to detect cases of 
multiple specificity. All four methods performed well in 
terms of cross-validation, with MUSI still giving statistic- 
ally significantly higher Area Under the ROC (AUC) 
values (Figure 2 and Supplementary Figures S2-S4). 
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and MEME) (Mean AUC: 0.9721, 0.9167 and 0.9192, respectively). 
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CPU performance 

A crucial requirement for MUSI is to perform efficiently 
with large data sets. Figure 3 shows the results in CPU 
time for different sets of nucleic acids and peptides. Each 
point represents either a nucleotide or peptide data set 
(MUSI is blue and MEME is red). The CPU time of 
MUSI including all steps from redundancy removal and 
alignment to logo generation grows linearly with the size 
of the input, enabling the processing of tens of thousands 
of unique sequences in a short time. Comparison with the 
MEME and the method of (24) (Figure 3) as well as 
BEEML (Supplementary Figure S4) software shows that 
MUSI runs orders of magnitude faster than these two 
methods. 

Using MUSI as a denoising tool 

A common issue with high-throughput experimental 
techniques is the high rate of false positives. It is therefore 
useful to understand how MUSI performs on noisier data 
sets. To investigate this, we used the experimental phage 
display data for the human SH3 domain and mixed them 
with increasingly higher numbers of randomly generated 
peptides (excluding random peptides that by chance were 
already present in the initial set). We observed that, in 
general, the presence of false-positives yields one addition- 
al completely unspecific PWM (Supplementary Figure S5) 
that contains almost all randomly generated peptides. 

Using MUSI with longer sequences 

Another important issue can arise with the length of the 
input sequences, since motifs spread out within long 
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sequences are much more difficult to automatically 
identify. To test this possible limitation, we added 
random flanking sequences on both sides of the phage 
display peptide data for the human SH3 domain. We 
observed that multiple specificity is correctly detected as 
long as the total size of the flanking sequences remains 
<40 (Supplementary Figure S6). This clearly exceeds the 
size of the motif itself. However, we stress that MUSI 
works optimally with sequences that are relatively short 
to enable the motifs to be accurately detected. 

Human SH3 domain multiple specificity 

The novel phage display data obtained for Src SH3 
domain reveal how several interesting features can be 
detected with MUSI (Figure 4A). SH3 domains have 
long been classified into two main specificity classes 
summarized with the motifs [R/K]xxPxxP (Class I) and 
PxxPx[R/K] (Class II) (6). From biochemical and crystal- 
lographic studies, SRC SH3 domain is known to accom- 
modate ligands of both classes. This can also be seen in the 
multiple specificity profiles of Figure 4A. Interestingly, 
SRC SH3 domain binds ASAP1 and this interaction is 
critical for the ASAP1 phosphorylation and membrane 
trafficking role (39). Although ASAP1 contains several 
proline rich regions, only two of them are known to 
bind SRC SH3 domains (39). The sequences of these 
two regions [(UniProt identifier: Q9ULH1) 791-797: 
PPLPPRN and 892-898: RVLPKLP] match exactly the 
two specificity profiles predicted by MUSI (see Figure 4a) 
and correspond to the two best hits of the multiple PWMs 
model along the ASAP1 sequence, while the single 
PWM failed to identify the first binding site. This 
further highlights how careful distinction between differ- 
ent specificities can improve biological insights. 

Transcription factor multiple specificity 

MUSI can also detect multiple binding specificity from 
protein binding microarray data on transcription factors 
(Figure 4B). For example, Smad3 (Mothers against 
decapentaplegic homolog3) is a well-known transcription 
factor, which is one of the modulators in the TGF-(3 sig- 
naling pathway (40). The Smad protein family plays an 
essential role in the intracellular signaling of transforming 
growth factor- (3 (TGF-(3), which initiates various cellular 
responses (41). In particular, Smad3 is known to bind 
CAGA box [AG(C/A)CAGACA] within the human 
PAI-1 promoter as its binding sites, which mediates 
TGF-(3-transcriptional induction and other down-stream 
stimulations (41). Furthermore, its other known binding 
motif is a GC-rich motif (42,43). MUSI is able to detect 
the two previously reported motifs. 



DISCUSSION 

Efficient analysis and visualization of large-scale experi- 
mental data is a crucial way of gaining novel insights into 
biological systems. For large data sets of peptides or nu- 
cleotides interacting with proteins, sequence logos enable 
very useful and intuitive visualization. Multiple logos, 
such as the ones used in (24) and (16) provide a natural 



e47 Nucleic Acids Research, 2012, Vol. 40, No. 6 



Page 6 of 8 



Single PWM 



Multiple PWMs 



eOP*p FOP.^p ^jlPj 



Src 



Src-PWMI (0.907) 



Src - PWM2 (0.093) 



B 



Smad3 



Smad3 - PWM1 (0.450) 



Smad3 - PWM2 (0.280) 



Smad3 - PWM3 (0.200) Smad3 - PWM4 (0.05) 

Smad3 - PWM5 (0.02) 

Figure 4. Example sequence logos. The logo on the left is generated from a single PWM and the logos on the right are generated from multiple 
PWMs using MUSI. (A) MUSI output for SH3 domain (Src). Even though a single logo can visualize certain binding specificity of SH3 domain 
(Class I), multiple PWMs using MUSI reveal both class I and class II binding specificities. (B) MUSI output for Smad3 transcription factor. MUSI 
detects two motifs that are already reported in UniPROBE (PWM1 and PWM3) as well as their reverse complements (PWM2 and PWM4). 



extension to the widely used single logos that can be 
applied in many instances to better describe the specificity 
of peptide recognition or DNA-binding domains. 
With MUSI, we aim to provide a standalone system 
for rapidly processing this kind of data. The algorithm 
takes standard sequence files as input and generates 
multiple PWMs together with the corresponding graphical 
sequence logos. For the use case of mapping peptide 
recognition specificity, MUSI supports a full input pre- 
processing stage for standard 96-well plate data labeled 
by barcodes. As phage-display technology combined 
with next-generation sequencing is currently one of the 
most powerful techniques for identifying peptide-binding 
motifs, we believe that this tool will be very useful for 
researchers in the field. 

The high efficiency and speed of MUSI derives from 
new development and optimization of the method initially 
proposed in (24). In particular, the replacement of the 
clustering step used to identify multiple specificity with a 
mixture model based method to determine the optimal 
number of PWMs allowed us to reduce the complexity 
of the algorithm from 0(N 2 MK 2 ) to 0(NMK 2 \ where N 
is the number of sequences, M their length and K the 
number of PWMs. This enables MUSI to rapidly 
process tens of thousands of sequences. 

Compared with existing motif discovery tools, such as 
YMF (44), and Weeder Web (45), MUSI is optimized for 
short ligands and, consequently, it can handle much larger 
data sets. As such, it is not designed to identify over- 
represented motifs spread out in long sequences, as this is 
for instance the case when looking for a motif within long 
upstream regions of co-expressed genes. Questions like this 
would require more involved and computationally more 
expensive strategies, such as Gibbs sampling (46), which 
does not easily scale with the current data set size. 



This limitation in sequence lengths might prevent straight- 
forward application of MUSI on data generated from 
ChlP-Seq experiments (47), unless peaks can be defined 
at a resolution of 40-50 bp or shorter. So, it would be rec- 
ommended to pre-process the ChlP-Seq data with peak 
finding algorithms such as QuEST (48) to obtain as short 
peaks as possible. However, our tests on different bench- 
marks indicate that MUSI can accommodate sequences 
that are significantly longer than the motif itself. The 
observed limit in sequence length (around 40) might also 
depend on the motif length. Indeed longer motifs convey 
more information and thus are more likely to be correctly 
recognized in both the alignment and the mixture model 
steps. Yet, since most biologically relevant motifs are quite 
short, it is clear that MUSI is particularly suited for experi- 
mental data consisting of relatively short sequences, such as 
the ones coming from phage display or protein binding 
microarrays. 

To summarize, MUSI addresses a need in high- 
throughput and high-resolution data analysis capability 
thus far missing. By mapping all instances of multiple spe- 
cificity, it is not only useful for predicting better protein 
interactions, but reveals some of the fundamental mech- 
anisms of encoding specificity in biological interaction 
networks. Moreover its speed and accuracy ensures 
that it can be used for new data that will be generated in 
future projects relying on high-throughput sequencing. 
Applications may range from synthetic data, such as the 
one produced with phage display technology, to in vivo 
data such as extensive sequencing of variable regions 
found on viral proteins, antibodies or T-cell receptors. 
We expect MUSI to become increasingly useful as 
DNA sequencing and microarray technology continues 
to improve and be applied to identify new protein or 
DNA binding motifs. 
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SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1 and Supplementary Figures 1-6. 
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